Efficiency of short loop instruction fetch

ABSTRACT

A method, system and computer program product for instruction fetching within a processor instruction unit, utilizing a loop buffer, one or more virtual loop buffers, and/or an instruction buffer. During instruction fetch, modified instruction buffers coupled to an instruction cache (I-cache) temporarily store instructions from a single branch, backwards short loop. The modified instruction buffers may be a loop buffer, one or more virtual loop buffers, and/or an instruction buffer. Instructions are stored in the modified instruction buffers for the length of the loop cycle. The instruction fetch within the instruction unit of a processor retrieves the instructions for the short loop from the modified buffers during the loop cycle, rather than from the instruction cache.

CROSS-REFERENCE TO RELATED APPLICATION

The present application claims priority under 35 U.S.C. §120 as acontinuation application to application Ser. No. 11/923,709, entitled“Apparatus and Method for Improving Efficiency of Short Loop InstructionFetch,” filed on Oct. 25, 2007, which is hereby incorporated byreference in its entirety.

BACKGROUND

1. Technical Field

The present invention generally relates to microprocessors and inparticular to a technique for enhancing operations within amicroprocessor.

2. Description of the Related Art

A microprocessor is a digital device that executes instructionsspecified by a computer program. A typical computer system includes amicroprocessor coupled to a system memory that stores programinstructions and data to be processed by the program instructions. Oneof the primary steps in executing instructions in a microprocessorinvolves fetching instructions from a cache. The majority ofmicroprocessors possess caches which store instructions and allow rapidfetching of those instructions without having to access the main memory.As microprocessors become smaller and faster there is a need to improvethe efficiency of the instruction fetch.

Several problems exist with the current method of instruction fetch fromthe instruction cache of a microprocessor. As an example, backward takenbranch loops such as “for” loops and “while” loops, are common shortloop constructs that frequent the instruction cache (I-cache). The forloop allows code to be executed repeatedly, often executing for adefinite number of loop counts. While loops, also executing repeatedly,are conditional and based on the outcome of a sequential instruction.For each of the backward taken branch loop commands and thecorresponding repeats, the I-cache is accessed repeatedly, even thoughthe entire loop resides in the instruction buffer (IBUF).

Frequently accessing the I-cache with for and while loops, also known asshort loops, increases device power consumption. As devices becomesmaller and more portable, lower power consumption is an importantfactor in microprocessor design. Repeated utilization of the I-cache forshort loops increases energy consumption.

Repeated access to the I-cache for short loops may also causeinstruction delays. For example, during an instruction fetch, delays mayoccur if the instruction cache is busy. Also the fetch logic mustarbitrate to access the I-cache, whether there is one or multiplethreads. In all these cases, increased latency can significantly degradethe efficiency of the multiprocessor.

SUMMARY OF ILLUSTRATIVE EMBODIMENTS

Disclosed are a method, system and computer program product forinstruction fetching within a processor instruction unit, utilizing aloop buffer, one or more virtual loop buffers, and/or an instructionbuffer. During instruction fetch, modified instruction buffers coupledto an instruction cache (I-cache) temporarily store instructions from asingle branch, backwards short loop. The modified instruction buffersmay be a loop buffer, one or more virtual loop buffers, and/or aninstruction buffer. Instructions are stored in the modified instructionbuffers for the length of the loop cycle. The instruction fetch withinthe instruction unit of a processor retrieves the instructions for theshort loop from the modified buffers during the loop cycle, rather thanfrom the instruction cache. Retrieving the instructions from themodified instruction buffers (a) reduces power usage (or energyconsumption) by eliminating repeat accesses to the I-cache and (b)increases processor performance by freeing the I-cache for processingnew instructions.

In one embodiment, a loop buffer is coupled to instruction buffers tostore and retrieve instructions from a single branch, backwards shortloop. The process may be performed in single thread mode or simultaneousmulti-thread mode (SMT). The instruction loop is detected and analyzedto calculate the number of loops the instructions will cycle. After theinstructions are loaded into the instruction buffer, the instructionfetch cycles through the loop buffer instead of the I-cache to obtainthe instructions. When the cycle for the single branch, backwards shortloop is complete the instruction fetch returns to processing data fromthe I-cache.

In one embodiment, the invention utilizes virtual loop buffers (VLB) tostore instructions from a single branch, backwards short loop in singlethread mode. Virtual loop buffers are added to instruction bufferscoupled to an I-cache. When a single branch, backwards short loop isdetected, instruction lengths less than or equal to the capacity of theVLB(s) are loaded into the instruction buffers. Once loaded into theinstruction buffers, the instructions are distributed to the VLB(s).Instructions are fetched from the VLB(s) until all cycles within theloop are complete. In single thread mode, after completing the cycle,the instruction unit returns to performing the instruction fetch fromthe I-cache until another single branch, backwards short loop isdetected.

The above as well as additional objectives, features, and advantages ofthe present invention will become apparent in the following detailedwritten description.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention itself, as well as a preferred mode of use, furtherobjects, and advantages thereof, will best be understood by reference tothe following detailed description of an illustrative embodiment whenread in conjunction with the accompanying drawings, wherein:

FIG. 1 is a block diagram of a microprocessor chip within a dataprocessing system, according to one embodiment of the invention;

FIG. 2 is a block diagram of microprocessor components in accordancewith one embodiment of the invention;

FIG. 3 is a diagram depicting instruction buffer enhancement with a loopbuffer according to one embodiment of the invention;

FIG. 4 is a diagram depicting instruction buffer enhancement withvirtual loop buffers in accordance with one embodiment of the invention;

FIG. 5 is a diagram depicting instruction buffer enhancement utilizing aloop sequence queue according to one embodiment of the invention;

FIG. 6 is a logic flow chart of the process of short loop instructionbuffer enhancement utilizing a loop buffer in accordance with oneembodiment of the invention;

FIG. 7 is a logic flow chart of the process of short loop instructionbuffer enhancement utilizing virtual loop buffers according to oneembodiment of the invention; and

FIG. 8 is a logic flow chart of the process of short loop instructionbuffer enhancement utilizing a register file instruction loop buffer inaccordance with one embodiment of the invention.

DETAILED DESCRIPTION OF AN ILLUSTRATIVE EMBODIMENT

The illustrative embodiments provide a method, system and computerprogram product for instruction fetching within a processor instructionunit, utilizing a loop buffer, one or more virtual loop buffers, and/oran instruction buffer. During instruction fetch modified instructionbuffers coupled to an instruction cache (I-cache) temporarily storeinstructions from a single branch, backwards short loop. The modifiedinstruction buffers may be a loop buffer, one or more virtual loopbuffers, and/or an instruction buffer. Instructions are stored in themodified instruction buffers for the length of the loop cycle. Theinstruction fetch within the instruction unit of a processor retrievesthe instructions for the short loop from the modified buffers during theloop cycle, rather than from the instruction cache.

In the following detailed description of exemplary embodiments of theinvention, specific exemplary embodiments in which the invention may bepracticed are described in sufficient detail to enable those skilled inthe art to practice the invention, and it is to be understood that otherembodiments may be utilized and that logical, architectural,programmatic, mechanical, electrical and other changes may be madewithout departing from the spirit or scope of the present invention. Thefollowing detailed description is, therefore, not to be taken in alimiting sense, and the scope of the present invention is defined onlyby the appended claims.

Within the descriptions of the figures, similar elements are providedsimilar names and reference numerals as those of the previous figure(s).Where a later figure utilizes the element in a different context or withdifferent functionality, the element is provided a different leadingnumeral representative of the figure number (e.g, 1 xx for FIGS. 1 and 2xx for FIG. 2). The specific numerals assigned to the elements areprovided solely to aid in the description and not meant to imply anylimitations (structural or functional) on the invention.

It is understood that the use of specific component, device and/orparameter names are for example only and not meant to imply anylimitations on the invention. The invention may thus be implemented withdifferent nomenclature/terminology utilized to describe thecomponents/devices/parameters herein, without limitation. Each termutilized herein is to be given its broadest interpretation given thecontext in which that terms is utilized.

With reference now to the figures, FIG. 1 depicts a block diagramrepresentation of a microprocessor chip within a data processing system150. Microprocessor chip 100 comprises microprocessor cores 102 a, 102b. Microprocessor cores 102 a, 102 b utilize instruction cache (I-cache)104 and data cache (D-cache) 106 as a buffer memory between externalmemory and microprocessor cores 102 a, 102 b. I-cache 104 and D-cache106 are level 1 (L1) caches, which are coupled to share level 2 (L2)cache 118. L2 cache 118 operates as a memory cache, external tomicroprocessor cores 102 a, 102 b. L2 cache 118 is coupled to memorycontroller 122. Memory controller 122 is configured to manage thetransfer of data between L2 cache 118 and main memory 126.Microprocessor chip 100 may also include level 3 (L3) directory 120. L3directory 120 provides on chip access to off chip L3 cache 124. L3 cache124 may be additional dynamic random access memory.

Those of ordinary skill in the art will appreciate that the hardware andbasic configuration depicted in FIG. 1 may vary. For example, otherdevices/components may be used in addition to or in place of thehardware depicted. The depicted example is not meant to implyarchitectural limitations with respect to the present invention.

With reference now to FIG. 2, there are illustrated the major functionalcomponents of microprocessor chip 100 utilized in instruction fetching.In the described embodiments, microprocessor cores 102 a, 102 b (FIG. 1)serve as the primary processing units in microprocessor chip 100.

Instruction fetching is controlled by instruction unit 202. Instructionunit 202 comprises branch execution unit (BEU) 224 which utilizesinstruction fetch 206 to initially obtain instructions from I-cache 204.I-cache 204 resides in the instruction unit 202 of processor core 200.The fetched instructions are placed in IBUF 1 208, IBUF 2 210, IBUF 3212, or IBUF 4 214. Instructions from I-cache 204 are temporarily savedin IBUF 1 208, IBUF 2 210, IBUF 3 212, and IBUF 4 214 before beingdecoded at instruction decode and dispatch (IDD) 216. Instructions areretrieved from IDD 216 and read by registers in execution unit 226.Processed instructions are transmitted to storage control unit 222 andthen to memory unit 218. In conventional processing, when utilizing IBUF1 208, IBUF 2 210, IBUF 3 212, or IBUF 4 214 during a loop cycle,instructions are repeatedly fetched from I-cache 204.

Instruction unit 202 may be utilized in simultaneous multi-thread (SMT)mode or single thread mode. A thread is a single sequential flow ofcontrol within an instruction set, or program. Programs may havemultiple threads and thereby multiple sequential flows of control. Whenmultiple threads are utilized, multiple processes can take place withinone cycle. In single thread mode, programs have a single sequential flowof control. However, a single thread is capable of working on a secondtask when idled by a previous task.

In one embodiment instruction fetch 206 may simultaneously cyclemultiple threads or a single thread through I-cache 204. Duringinstruction cycling, a backwards short loop is detected with no furtherinstruction branches within the loop. Detection may be done by utilizinga branch target address cache to identify the single branch, backwardsshort loop. To bypass repeat fetching of instructions from I-cache 204during SMT mode or single thread mode cycling, an additional buffer isadded to the instruction buffers within instruction unit 202.

FIG. 3 illustrates an example I-cache 304, with loop buffer 330, as wellas instruction buffers IBUF 1 308, IBUF 2 310, IBUF 3 312, and IBUF 4314. IDD 316 and branch execution unit 324 assist in processing theinstructions after the instructions exit from the IBUFs. In oneembodiment, loop buffer 330 is added to process single branch, backwardsshort loop instructions in single thread or SMT mode. A single branch,backwards short loop enters I-cache 304. Loop buffer (LB) 330temporarily stores the backwards short loop during instruction cycling.Then, instead of repeat access to I-cache 304, loop buffer 330 isaccessed until the loop cycle is complete. Accessing loop buffer 330 forthe loop cycle enables I-cache 304 to be available to process newinstruction threads. Upon completion of the loop cycle, processingcontinues to IDD 316 and/or resumes instruction fetch from I-cache 304.

In one embodiment, the length of the single branch, backwards short loopinstructions cycling from I-cache 304 is greater than the capacity ofloop buffer 330. When the length of the instructions exceed the lengthof LB 330, then IBUF 1 308, IBUF 2 310, IBUF 3 312, and/or IBUF 4 314are utilized to assist in storing the loop instructions. Upon completionof the loop, instruction fetch 206 in FIG. 2 resumes processinginstructions from I-cache 304.

In one embodiment, during single thread mode, a backwards short loopcontaining no further branches within the loop is detected. As providedby FIG. 4, two virtual loop buffers (VLB) 420 are added to instructionbuffers IBUF 1 408, IBUF 2 410, IBUF 3 412, and IBUF 4 414. Thebackwards short loop instructions are distributed across VLB 430 fromIBUF 1 408, IBUF 2 410, IBUF 3 412, and IBUF 4 414. The number ofinstruction buffers utilized during the cycle is contingent on thenumber of instructions within the loop. IDD 416 and branch executionunit 424 assist in processing the instructions after the instructionsexit from the IBUFs. The single branch backwards short loop cyclesthrough I-cache 404 and into IBUF 1 408, IBUF 2 410, IBUF 3 412, andIBUF 4 414. In the illustrative embodiments, the virtual loops may holdthirty-two lines of instructions (four instructions per line, with eachof four buffers comprising two virtual loop buffers). The length of thebackwards short loop may not exceed the capacity of VLBs 430. VLB 430 isrepeatedly accessed for the length of the loop cycles. While the loopcycle is ongoing, I-cache 404 may be turned off or may be madeaccessible to new instructions.

FIG. 5 illustrates one embodiment in which I-cache 504 couples to asingle instruction buffer and register file (IBUF) 508. IDD 516 andbranch execution unit 524 assist in processing the instructions afterthe instructions exit from IBUF 508. Instruction addresses are writtento register file IBUF 508 at IBUF write 511, the address entries arelocked in to IBUF 508 with locked entry 515. The instructions may bedecoded (via IBUF address decoder 513) and sent to loop sequence queue517. During single thread mode, a backwards short loop containing nofurther branches within the loop is detected. Register file IBUF 508 isloaded with instructions from the loop, and loop sequence queue 517captures the address and sequence of the loop instructions. IDD 516selects instructions from register file IBUF 508 as indexed by loopsequence 517. Loop sequence queue 517 continues rotating until the lastinstruction within the loop has been encountered.

FIGS. 6-8 are flow charts illustrating various methods by which theabove processes of the illustrative embodiments are completed.Specifically, the method of FIG. 6 relates to the configuration providedby FIG. 3, the method of FIG. 7 relates to the configuration provided byFIG. 4, and the method of FIG. 8 relates to the configuration providedby FIG. 5. Although the methods illustrated in FIGS. 6-8 may bedescribed with reference to components shown in respective FIGS. 3-5, itshould be understood that this is merely for convenience and alternatecomponents and/or configurations thereof can be employed whenimplementing the various methods.

The process of FIG. 6 begins at initiator block 600 when a singlebranched, backwards short loop is cycled through I-cache 304 (FIG. 3).The loop is detected and analyzed at block 602. A decision is made atblock 604 whether the length of the single branch, backwards short loopinstruction is greater than the capacity of loop buffer 330 (FIG. 3). Ifthe number of instructions is greater than the loop buffer capacity,then the instruction fetch proceeds to fetch instructions for the loopcycle utilizing I-cache 304, as shown at block 606. Then, the processends for this embodiment at block 608. If the number of instructions (orthe instruction length) is not greater than the capacity of LB 330, thenthe process proceeds to block 610.

At block 610 the number of loop cycles is determined. The single branch,backward short loop is then loaded into LB 330 at block 612. After LB330 is loaded with the instructions for the loop, the logic/utilityproceeds to access LB 330 instead of the I-cache 304 for the loopinstructions, at block 614. At block 616, the IDD rotates instructionsprocessed from LB 330 back to LB 330. A decision is made at block 618whether the end of the loop cycle has been reached. If all cycles of theloop have not been completed, the process returns to block 614 to fetchthe instructions from LB 330. When the loop cycles are complete, thefetch instructions logic/utility returns to fetching instructions fromI-cache 304, freeing LB 320 for the next single branch, backwards shortloop sequence. The process ends at block 622.

The FIG. 7 flow chart begins with block 700, where virtual loop buffers,such as VLB 430 (FIG. 4) are utilized to reduce the amount ofinstructions fetched from I-cache 404 (FIG. 4). At block 702 a decisionis made, whether the current mode of instruction cycling is singlethread. If the mode is not single thread, the instruction fetch processcompletes instruction fetch from I-cache 404, as shown at block 704, andthe process ends at block 706.

If the mode is single thread the process continues to block 708 wherethe single branch, backwards short loop is detected and analyzed. Atblock 710 a decision is made whether the loop instruction length exceedsthe capacity of VLB 430. If the instructions exceed the capacity of VLB430 the process proceeds to block 704, which indicates that theinstruction fetch is completed from I-cache 404. If the instructions areless than or equal to the capacity of VLB 430, then the IBUFs are loadedwith the instructions at block 712.

At block 714, after the IBUFs are loaded with the instructions, theinstructions are distributed to VLB 430 in each IBUF. The instructionsare fetched by IDD from VLB 430 at block 716. A decision is made atblock 718 whether the end of the loop cycle has been reached. If theloop cycles are not complete, the process returns to block 716, whichshows that the instructions are fetched from VLB 420. When the loopcycles are complete, the instruction fetch process returns to completinginstruction fetch from I-cache 404, as shown at block 720. The processends at block 722.

The process of FIG. 8 begins at initiator block 800 after a singlebranched, backwards short loop is cycled through I-cache 504 (FIG. 5).The loop is detected and analyzed at block 802. At block 804, the singlebranch, backward short loop instructions is loaded into IBUF 508. Theprocess continues to block 806 at which the number of instructionswithin the loop is determined. At step 808, the loop instructions,address, and sequence are locked into IBUF 508. The instructionaddresses are saved in loop sequence queue 517, at step 810. At step812, loop sequence queue 517 rotates the loop instructions as requiredby the instruction fetch. Instructions are exported to IDD 516, at step814. A decision is made at step 816, whether the end of the loop cyclehas been reached. If instructions remain in the loop cycle, the processreturns to block 812, where the loop sequence queue 517 continues torotate the instructions. If the end of the cycle has been reached, thecycle returns to instruction fetch from I-cache 504, at step 818. Theprocess ends at step 820.

In the flow charts above, one or more of the methods are embodied in acomputer readable medium containing computer readable code such that aseries of steps are performed when the computer readable code isexecuted on a computing device. In some implementations, certain stepsof the methods are combined, performed simultaneously or in a differentorder, or perhaps omitted, without deviating from the spirit and scopeof the invention. Thus, while the method steps are described andillustrated in a particular sequence, use of a specific sequence ofsteps is not meant to imply any limitations on the invention. Changesmay be made with regards to the sequence of steps without departing fromthe spirit or scope of the present invention. Use of a particularsequence is therefore, not to be taken in a limiting sense, and thescope of the present invention is defined only by the appended claims.

Generally, retrieving the instructions from the modified instructionbuffers (a) reduces power usage (or energy consumption) by eliminatingrepeat accesses to the I-cache and (b) increases processor performanceby freeing the I-cache for processing new instructions.

As will be further appreciated, the processes in embodiments of thepresent invention may be implemented using any combination of software,firmware or hardware. As a preparatory step to practicing the inventionin software, the programming code (whether software or firmware) willtypically be stored in one or more machine readable storage mediums suchas fixed (hard) drives, diskettes, optical disks, magnetic tape,semiconductor memories such as ROMs, PROMs, etc., thereby making anarticle of manufacture in accordance with the invention. The article ofmanufacture containing the programming code is used by either executingthe code directly from the storage device, by copying the code from thestorage device into another storage device such as a hard disk, RAM,etc., or by transmitting the code for remote execution usingtransmission type media such as digital and analog communication links.The methods of the invention may be practiced by combining one or moremachine-readable storage devices containing the code according to thepresent invention with appropriate processing hardware to execute thecode contained therein. An apparatus for practicing the invention couldbe one or more processing devices and storage systems containing orhaving network access to program(s) coded in accordance with theinvention.

Thus, it is important that while an illustrative embodiment of thepresent invention is described in the context of a fully functionalcomputer (server) system with installed (or executed) software, thoseskilled in the art will appreciate that the software aspects of anillustrative embodiment of the present invention are capable of beingdistributed as a program product in a variety of forms, and that anillustrative embodiment of the present invention applies equallyregardless of the particular type of media used to actually carry outthe distribution. By way of example, a non exclusive list of types ofmedia, includes recordable type (tangible) media such as floppy disks,thumb drives, hard disk drives, CD ROMs, DVDs, and transmission typemedia such as digital and analog communication links.

While the invention has been described with reference to exemplaryembodiments, it will be understood by those skilled in the art thatvarious changes may be made and equivalents may be substituted forelements thereof without departing from the scope of the invention. Inaddition, many modifications may be made to adapt a particular system,device or component thereof to the teachings of the invention withoutdeparting from the essential scope thereof. Therefore, it is intendedthat the invention not be limited to the particular embodimentsdisclosed for carrying out this invention, but that the invention willinclude all embodiments falling within the scope of the appended claims.Moreover, the use of the terms first, second, etc. do not denote anyorder or importance, but rather the terms first, second, etc. are usedto distinguish one element from another.

1. A processor comprising: one or more execution units; an instructioncache having instructions stored therein for execution by an executionunit; an instruction unit coupled to the one or more execution units andwhich provides instructions fetched from the instruction cache to theone or more execution units for execution; logic associated with theinstruction unit for: detecting a presence of a single branch, backwardsshort loop within a stream of fetched instructions; buffering the singlebranch, backwards short loop in a local loop buffer of the instructionunit, wherein the local loop buffer is a separate buffer componentcoupled to the instruction cache and to the one or more buffers;retrieving instructions for completing the single branch, backwardsshort loop from the local loop buffer rather than fetching theinstructions from the instruction cache; and recording and tracking anumber of execution loops required for the short loop, whereininstructions are retrieved from the local loop buffer during the shortloop until the counter logic indicates that all cycles of the short loophave completed.
 2. The processor of claim 1, wherein: said instructionunit comprises: an instruction fetch unit for fetching instructions fromthe instruction cache; one or more instruction buffers within whichfetched instructions are initially buffered prior to being sent to theone or more execution units; and an instruction decode and dispatch unitthat forwards instructions from the one or more instruction buffers tothe one or more execution units; and said logic comprises logic fordetermining that the single branch, backwards short loop does notcontain any other branches within the loop.
 3. The processor of claim 1,wherein the local loop buffer is a register file instruction buffer, andsaid logic further comprises: logic for loading the register fileinstruction buffer with instructions from the single branch, backwardsshort loop; logic for locking instruction entries into the register fileinstruction buffer; logic for writing an instruction address to theregister file instruction buffer; when the end of a loop cycle isdetected, logic for unlocking and clearing the instruction entries fromthe register file instruction buffer; and logic for resuming instructionfetch from the instruction cache.
 4. The processor of claim 3, whereinthe register file is coupled to a loop sequence, said logic furthercomprising: logic for saving instruction addresses in the loop sequencequeue; logic for rotating instruction addresses within the loop sequencequeue as the instructions are cycled through an execution sequence; andlogic for exporting instruction addresses from the loop sequence queuewhen the end of the loop cycle is detected.
 5. The processor of claim 1,wherein said logic comprises: logic for detecting a scheduled executionof a sequence of instructions that constitutes the single branch,backwards short loop, wherein said logic includes a branch targetaddress cache, which is utilized to identify the single branch,backwards short loop; and logic for automatically storing theinstructions of the short loop within one of a loop buffer, a virtualloop buffer, and a register file instruction buffer established as thelocal loop buffer.
 6. The processor of claim 1, said logic furthercomprising: logic for determining the number of cycles within the shortloop; logic for loading the instruction of the short loop into one of aloop buffer, a virtual loop buffer, and a register file instructionbuffer; and logic for subsequently fetching instructions for the shortloop from the one of the loop buffer, the virtual loop buffer, and theregister file instruction buffer hosting the short loop instructionsduring subsequent instruction processing for the determined number ofcycles within the short loop.
 7. The processor of claim 1, wherein saidlogic for detecting further comprises: when a number of cycles withinthe loop is not complete, logic for fetching instructions from the localloop buffer; logic for detecting the end of the number of cycles; logicfor removing short loop instructions from the local loop buffer(s) afterthe number of cycles have completed; and logic for resuming instructionfetching from the instruction cache when an I-cache fetch condition isdetected from among: (a) the end of the number of cycles is detected and(b) the local loop buffer does not contain the loop instructions.
 8. Adata processing system having a memory coupled to a processor that isconfigured to operate according to claim
 1. 9. In a processor having oneor more execution units, an instruction cache having instructions storedtherein for execution by an execution unit, and an instruction unitcoupled to the one or more execution units and which providesinstructions fetched from the instruction cache to the one or moreexecution units for execution, a method comprising: detecting a presenceof a single branch, backwards short loop within a stream of fetchedinstructions; buffering the single branch, backwards short loop in alocal loop buffer of the instruction unit, wherein the local loop bufferis a separate buffer component coupled to the instruction cache and tothe one or more buffers; retrieving instructions for completing thesingle branch, backwards short loop from the local loop buffer ratherthan fetching the instructions from the instruction cache; and recordingand tracking a number of execution loops required for the short loop,wherein instructions are retrieved from the local loop buffer during theshort loop until the counter logic indicates that all cycles of theshort loop have completed.
 10. The method of claim 9, wherein the localloop buffer is a register file instruction buffer coupled to a loopsequence, and said method further comprises: loading the register fileinstruction buffer with instructions from the single branch, backwardsshort loop; locking instruction entries into the register fileinstruction buffer; writing an instruction address to the register fileinstruction buffer; saving instruction addresses in the loop sequencequeue; rotating instruction addresses within the loop sequence queue asthe instructions are cycled through an execution sequence; and when theend of a loop cycle is detected: exporting instruction addresses fromthe loop sequence queue when the end of the loop cycle is detected;unlocking and clearing the instruction entries from the register fileinstruction buffer; and resuming instruction fetch from the instructioncache.
 11. The method of claim 9, wherein said instruction unit includesan instruction fetch unit for fetching instructions from the instructioncache, one or more instruction buffers within which fetched instructionsare initially buffered prior to being sent to the one or more executionunits; and an instruction decode and dispatch unit that forwardsinstructions from the one or more instruction buffers to the one or moreexecution units; and wherein said method further comprises: detecting ascheduled execution of a sequence of instructions that constitutes thesingle branch, backwards short loop, wherein said logic includes abranch target address cache, which is utilized to identify the singlebranch, backwards short loop; and automatically storing the instructionsof the short loop within one of a loop buffer, a virtual loop buffer,and a register file instruction buffer established as the local loopbuffer. determining the number of cycles within the short loop;subsequently fetching instructions for the short loop from the one ofthe loop buffer, the virtual loop buffer, and the register fileinstruction buffer hosting the short loop instructions during subsequentinstruction processing for the determined number of cycles within theshort loop.
 12. The method of claim 9, wherein said detecting furthercomprises: when a number of cycles within the loop is not complete,fetching instructions from the local loop buffer; detecting the end ofthe number of cycles; removing short loop instructions from the localloop buffer(s) after the number of cycles have completed; and resuminginstruction fetching from the instruction cache when an I-cache fetchcondition is detected from among: (a) the end of the number of cycles isdetected and (b) the local loop buffer does not contain the loopinstructions.
 13. A computer program product comprising a storage mediumand computer program code on the storage medium that executes within aprocessor and provides software-controllable logic that triggers thefunctions recited by claim 9.